The initial goal of this project was to perform an efficient analysis of a large set of clathrate CIF files to explore the relationship between lattice parameters and interatomic distances. However, initial results revealed significant outliers.
This report documents the final, successful workflow, which evolved to handle the complexities of real-world crystallographic data, including site-occupancy disorder and structural variations. The analysis now incorporates a dynamic, category-based filtering pipeline that correctly processes a manually curated dataset of 695 CIF files.
The curated dataset is organized into subdirectories within a main
Verified folder. Each subdirectory’s name defines the
specific filtering rules for the files it contains (e.g.,
6c-16i-24k or 6c-16i-24k+M_on_24k).
The first step is to recursively find all CIF files, accounting for
potential inconsistencies in file extensions (e.g., .cif
vs. .CIF), and map each file to its structural
category.
# --- Timing Start ---
timing_load <- system.time({
# Define the root directory containing the categorized folders
cif_root_dir <- "/Users/donngo/repos/ml-crystals/notebook/workflow/Verified"
if (!dir.exists(cif_root_dir)) {
stop(paste(
"CRITICAL ERROR: The root directory was not found at:",
cif_root_dir
))
}
category_dirs <- list.dirs(path = cif_root_dir,
full.names = TRUE,
recursive = FALSE)
# Create a file map that is robust to file extension case
file_map_list <- lapply(category_dirs, function(dir) {
full_paths <- list.files(
path = dir,
pattern = "\\.cif$",
full.names = TRUE,
ignore.case = TRUE
)
if (length(full_paths) == 0)
return(NULL)
data.table(
file_path = full_paths,
file_name = basename(full_paths),
# Base filename for later lookup
category = basename(dir)
)
})
file_map <- rbindlist(file_map_list, fill = TRUE)
all_cif_paths <- if (nrow(file_map) > 0)
file_map$file_path
else
character(0)
}) # --- Timing End ---
cat(
"Found",
length(all_cif_paths),
"total CIF files to analyze across",
length(category_dirs),
"categories.\n"
)
## Found 702 total CIF files to analyze across 10 categories.
if (length(all_cif_paths) == 0) {
stop("Execution stopped: No CIF files were found. Please check your 'Verified' folder.")
}
To significantly improve performance, we replace the high-level
analyze_cif_files() wrapper with a targeted, manual
pipeline. This custom loop runs only the essential functions
required for our analysis (extracting coordinates and calculating
distances), skipping unnecessary computations like bond angles and error
propagation.
# --- Timing Start ---
timing_analysis <- system.time({
# This function runs only the essential calculations for a single file
process_single_cif_for_distances <- function(file_path) {
cif_content <- read_cif_files(file_path)[[1]]
# Extract only what we need
metrics <- extract_unit_cell_metrics(cif_content)
coords <- extract_atomic_coordinates(cif_content)
sym_ops <- extract_symmetry_operations(cif_content)
# Stop if essential data is missing
if (is.null(coords) || is.null(sym_ops) || is.null(metrics)) {
return(NULL)
}
# Perform only the necessary geometric calculations
full_cell <- apply_symmetry_operations(coords, sym_ops)
super_cell <- expand_transformed_coords(full_cell)
distances <- calculate_distances(coords, super_cell, metrics)
# Return a lean data.table with just the required results
return(data.table(
file_name = basename(file_path),
unit_cell_metrics = list(metrics),
atomic_coordinates = list(coords),
distances = list(distances)
))
}
# Apply this lean function to all files
results_list <- lapply(all_cif_paths, function(path) {
tryCatch({
process_single_cif_for_distances(path)
}, error = function(e) {
warning(paste("Failed to process", basename(path), ":", e$message))
return(NULL) # Skip files that fail
})
})
analysis_results <- rbindlist(results_list, fill = TRUE)
}) # --- Timing End ---
cat("Optimized analysis complete. The raw results table has",
nrow(analysis_results),
"rows.\n")
## Optimized analysis complete. The raw results table has 702 rows.
cat("\n--- Optimized Core Performance ---\n")
##
## --- Optimized Core Performance ---
print(timing_analysis)
## user system elapsed
## 94.470 1.329 96.034
This section implements the sophisticated filtering pipeline. For each file, the script reads its assigned category and applies a specific sequence of filters before calculating the final weighted average network distance. This chunk now calculates both weighted and unweighted averages in a single pass for efficiency.
The steps for each file are: 1. Parse Category Rules: Determine the target Wyckoff sites and whether guest atoms need to be excluded based on the folder name. 2. Filter Guest Atoms (Conditional): For structures with non-network atoms on framework sites, remove all distances involving those specific guest atoms. 3. Filter “Ghost” Distances: Remove non-physical, short distances that arise from modeling multiple atoms on the same disordered site. 4. Identify Bonds: From the cleaned distances, identify the physically bonded atoms using the Minimum Distance Method. 5. Calculate Average Bond Lengths: Using only the identified bonds, calculate both the weighted and a simple unweighted average bond length.
# --- Timing Start ---
timing_processing <- system.time({
guest_atoms <- c("Na", "K", "Rb", "Cs", "Sr", "Ba", "Eu")
all_removed_ghosts <- list()
process_file_results <- function(i) {
current_filename <- analysis_results$file_name[i]
if (is.null(current_filename)) return(NULL)
category <- file_map[file_name == current_filename, category]
if (length(category) == 0) return(NULL)
exclude_guests <- grepl("\\+M_on_", category)
wyckoff_part <- gsub("\\+M_on_.*", "", category)
target_wyckoff_symbols <- strsplit(wyckoff_part, "-")[[1]]
distances <- analysis_results$distances[[i]]
coords <- analysis_results$atomic_coordinates[[i]]
if (is.null(distances) || is.null(coords)) return(NULL)
if (exclude_guests) {
distances <- filter_by_elements(distances, coords, guest_atoms)
}
# Step 1: Filter ghost distances
ghost_filter_result <- filter_ghost_distances(distances, coords, tolerance = 0.4)
cleaned_distances <- ghost_filter_result$kept
removed_table <- ghost_filter_result$removed
if (nrow(removed_table) > 0) {
removed_table[, file := current_filename]
all_removed_ghosts[[length(all_removed_ghosts) + 1]] <<- removed_table
}
# Step 2: Identify bonded pairs from the cleaned distances
bonded_pairs <- minimum_distance(cleaned_distances)
# Filter for bonds within the specified network
site_info <- coords[, .(Label, FullWyckoff = paste0(WyckoffMultiplicity, WyckoffSymbol))]
network_atom_labels <- site_info[FullWyckoff %in% target_wyckoff_symbols, Label]
network_bonds <- bonded_pairs[Atom1 %in% network_atom_labels]
if (nrow(network_bonds) == 0) return(NULL)
# Step 3: Calculate both averages using ONLY the network bonds
weighted_avg <- calculate_weighted_average_network_distance(
network_bonds, coords, target_wyckoff_symbols
)
unweighted_avg <- mean(network_bonds$Distance, na.rm = TRUE)
if (is.na(weighted_avg)) return(NULL)
return(
data.table(
file = current_filename,
category = category,
lattice_parameter_a = analysis_results$unit_cell_metrics[[i]]$`_cell_length_a`,
weighted_distance = weighted_avg,
unweighted_distance = unweighted_avg
)
)
}
plot_data_list <- lapply(1:nrow(analysis_results), process_file_results)
plot_data <- rbindlist(plot_data_list, use.names = TRUE, fill = TRUE)
plot_data <- na.omit(plot_data)
removed_ghosts_summary <- rbindlist(all_removed_ghosts, fill = TRUE)
}) # --- Timing End ---
# Final summary output
cat("Data processed. Final plot table has", nrow(plot_data), "entries.\n")
## Data processed. Final plot table has 702 entries.
cat(
nrow(removed_ghosts_summary),
"non-physical 'ghost' distances were identified and removed.\n"
)
## 443 non-physical 'ghost' distances were identified and removed.
This section provides a summary of all interatomic distances that were automatically filtered out. This allows for transparent verification of the filtering step, ensuring that the final average bond lengths are computed only from physically plausible atomic separations.
if (exists("removed_ghosts_summary") &&
nrow(removed_ghosts_summary) > 0) {
datatable(
removed_ghosts_summary,
caption = "Summary of Removed Ghost Distances",
rownames = FALSE,
extensions = 'Buttons',
options = list(
pageLength = 10,
dom = 'Bfrtip',
buttons = c('copy', 'csv')
),
colnames = c(
"Atom 1",
"Atom 2",
"Removed Distance (Å)",
"Minimum Allowed (Å)",
"File"
)
)
} else {
cat("No 'ghost' distances were detected during the analysis.")
}
The plot below visualizes the final, cleaned data, using the weighted average bond length. Each point represents a single clathrate structure, colored by its structural category.
# Generate the final interactive plot using the weighted distance
p <- ggplot(
plot_data,
aes(
x = lattice_parameter_a,
y = weighted_distance,
color = category,
text = file
)
) +
geom_point(alpha = 0.8, size = 2.5) +
geom_smooth(
aes(group = 1),
method = "lm",
se = FALSE,
color = "black",
linetype = "dotted",
fullrange = TRUE
) +
labs(
title = "Average Network Bond Length vs. Lattice Parameter 'a'",
subtitle = "Data points are colored by their structural category",
x = "Lattice Parameter a (Å)",
y = "Average Network Bond Length (Å)",
color = "Structural Category"
) +
theme_bw(base_size = 14) +
theme(legend.position = "bottom",
legend.title = element_text(face = "bold"))
interactive_plot <- ggplotly(p, tooltip = c("x", "y", "text", "color"))
interactive_plot
The plot reveals a strong, positive linear correlation between the lattice parameter a and the average network bond length. This is the expected physical behavior. Thanks to the rigorous, category-based filtering pipeline, the outliers initially observed have been successfully resolved, and the data now forms a much cleaner trend line.
To illustrate the impact of weighting by site multiplicity, this section provides a direct comparison between the weighted average bond length and a simple unweighted average of the same underlying bonds.
# --- 1. Reshape the data (already calculated) for plotting ---
plot_data_melted <- melt(
plot_data,
id.vars = c("file", "lattice_parameter_a", "category"),
measure.vars = c("weighted_distance", "unweighted_distance"),
variable.name = "Average_Type",
value.name = "Distance"
)
# --- 2. Generate the comparison plot ---
p_compare <- ggplot(
plot_data_melted,
aes(
x = lattice_parameter_a,
y = Distance,
color = Average_Type,
text = file
)
) +
geom_point(alpha = 0.7, size = 2) +
geom_smooth(method = "lm",
se = FALSE,
aes(linetype = Average_Type)) +
labs(
title = "Comparison of Weighted vs. Unweighted Average Bond Lengths",
subtitle = "Illustrates the effect of weighting by site multiplicity",
x = "Lattice Parameter a (Å)",
y = "Average Bond Length (Å)",
color = "Average Type",
linetype = "Average Type"
) +
theme_bw(base_size = 14) +
theme(legend.position = "bottom",
legend.title = element_text(face = "bold"))
# Render as an interactive plot
ggplotly(p_compare, tooltip = c("x", "y", "color", "text"))
This table provides the final, processed data used in the plots above. It can be searched, sorted, and the data can be exported to CSV or Excel for further external analysis.
datatable(
plot_data,
caption = "Final Processed Data for Clathrate Structures",
rownames = FALSE,
filter = 'top',
extensions = 'Buttons',
options = list(
pageLength = 15,
dom = 'Bfrtip',
buttons = c('copy', 'csv', 'excel')
),
colnames = c(
"File",
"Category",
"Lattice 'a' (Å)",
"Weighted Dist (Å)",
"Unweighted Dist (Å)"
)
)
This final section saves the key results—the interactive plot and the final data table—as standalone HTML files for easy sharing and exploration.
# Create a directory to store the HTML files
export_dir <- "interactive_report_files"
dir.create(export_dir, showWarnings = FALSE)
# --- 1. Save the interactive plot ---
plot_filename <- file.path(export_dir, "interactive_clathrate_plot.html")
saveWidget(interactive_plot, file = plot_filename, selfcontained = TRUE)
cat(paste0("Interactive plot saved to '", plot_filename, "'\n"))
## Interactive plot saved to 'interactive_report_files/interactive_clathrate_plot.html'
# --- 2. Save the final `plot_data` table ---
# Add the category column to the final data table for export
final_data_table <- datatable(
plot_data,
caption = "Final Processed Data for Clathrate Structures",
extensions = 'Buttons',
options = list(dom = 'Bfrtip', buttons = c('copy', 'csv', 'excel'))
)
plot_table_filename <- file.path(export_dir, "final_clathrate_data.html")
saveWidget(final_data_table, file = plot_table_filename, selfcontained = TRUE)
cat(paste0(
"Interactive plot data table saved to '",
plot_table_filename,
"'\n"
))
## Interactive plot data table saved to 'interactive_report_files/final_clathrate_data.html'
cat("...Done. Check the '", export_dir, "' folder.\n")
## ...Done. Check the ' interactive_report_files ' folder.